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Introduction 

A rare disease is a pathological condition with low prevalence 
and incidence. There are between 6000 and 8000 rare diseases. Many 
rare diseases are sparsely distributed in some geographic areas and 
more frequent in others, for reasons linked fo genefic facfors, envi¬ 
ronmental conditions that influence fhe spread of pathogens and the 
life habifs. Thalassemia, for example, is a relafively common genefic 
disease in the Mediterranean basin (very common in Southern Italy) 
and rare in the United States. 

A disease or disorder is defined as rare in Europe when it affects 
less than 5 in 10,000.^ One rare disease may affect only a handful of 
pafienfs in fhe EU, and anofher touch as many as 245,000. Overall, 
rare diseases may affecf 30 million European Union cifizens. In 
fhe United Sfafes a rare (or orphan) disease is defined as having 
a prevalence of fewer than 200,000 affected individuals.^ Many 
diseases are much rarer, reaching a rate of one case per 100,000 

^http://ec.europa.eu/health-eu/health_problems/rare_diseases/index_en. 

htm. 

^http: / / www.nlm.nih. go v / medlineplus / rarediseases.html. 
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persons or more. 

Rare disease patients too often face common problems, including 
the lack of access to correct diagnosis, delay in diagnosis, lack of 
quality information on the disease, lack of scientific knowledge of 
the disease, inequities and difficulties in access to treatment and 
care. These things could be changed by implementing a comprehen¬ 
sive approach to rare diseases, increasing international cooperation 
in scientific research, by gaining and sharing scientific knowledge 
about all rare diseases, not only the most "frequent" ones, and by 
developing tools for exfracfing and sharing knowledge. 

Organizations such as the National Institute of Health (NIH), 
Office of Rare Diseases Research (ORDR), National Organization 
for rare Disorders (NORD) and Orphanet provide information to 
patients and physicians and facilitate the exchange of information 
among different actors involved in this field by standardization 
in clinical terminologies, key factors in information retrieval and 
information exchange. 

The ORDR was established in 1993 within the Office of the Di¬ 
rector of the NIH, the Federal point of biomedical research. The 
aim of ORDR is to coordinate and support rare disease research, 
responding to research opportunities and providing information, 
promoting international collaboration and interoperation. 

Orphanet, on the other hand, was established in 1997 by the 
French Ministry of Health (Direction Generate de la Sante)^ and the 
Institut National de la Sante et de la Recherche Medicate (INSERM).^ 
Orphanet maintains a database of informafion on rare diseases and 
orphan drugs for all publics and aims to contribute to the improve¬ 
ment of the diagnosis, care and treatment of patients with rare 
diseases. It includes a Professional Encyclopedia which is a compre- 

^http: / / www.sante.gouv.fr. 

^http: / / www.inserm.fr. 


4783-2 



JLIS.it. Vol. 3, n. 1 (Giugno/June 2012) 


hensive collection of review articles on rare diseases, author-based 
and peer-reviewed, a Patient Encyclopedia and a Directory of expert 
Services. This Directory includes information on relevant clinics, 
clinical laboratories, research activities and patient organizations. 

The NORD was founded in 1983 by patients and families who 
worked together to get the Orphan Drug Act passed. This legis¬ 
lation provides financial incentives to encourage development of 
new treatments of rare diseases. The purpose of NORD is to supply 
information about rare diseases, referrals to patient organizations, 
research grants and all those people that have interest in rare disease. 
The purpose of NORD is to supply information about rare diseases, 
referrals to patient organizations, research grants and all those peo¬ 
ple that have interest in rare disease. It isn't a government agency; 
it is a non-profit voluntary health agency that exists to serve rare- 
disease patients and their families. Its primary sources of funding 
are contributions membership fees. 


Objective 

The aim of this project is to analyze a specific area of biomedical 
terminologies, namely rare disease terms. The representation of rare 
diseases terms has been analyzed in biomedical terminologies such 
as Medical Subject Headings (MeSH), International Classification 
of Diseases (ICD)-IO, Systematized Nomenclature of Medicine - 
Clinical Terms (SNOMED-CT) and Online Mendelian Inheritance 
in Man (OMIM), leveraging the fact that these terminologies are 
integrated in the Unified Medical Language System (UMLS). It has 
been analyzed the overlap among sources and the presence of rare 
diseases terms in target sources included in UMLS, working at the 
term and concept level. 
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Material 

In this section the resources used in this study are briefly de¬ 
scribed: the two sources of rare disease terms (ORDR and Or- 
phanet), the four target terminologies (ICD, MeSH, OMIM, and 
SNOMED-CT) and the UMLS. 

The UMLS®is a terminology integration system developed at 
the National Library of Medicine. The UMLS Metathesaurus® in¬ 
tegrates almost 160 biomedical vocabularies, including the four 
target vocabularies under investigation (ICD-10, MeSH, OMIM and 
SNOMED-CT). Synonymous terms from the various source vocabu¬ 
laries are grouped into one concept. Additionally, the Meta thesaurus 
records the relations asserted among terms in the source vocabu¬ 
laries, including hierarchical, associative and mapping relations. 
Version 2010AB of the UMLS is used in this study. This version 
contains approximately 2.4 M concepts and 40 M relations. 

Source terminologies 

The ORDR® publishes a list of rare diseases. This resource does 
not represent any relations among rare diseases, but groups all the 
S}monyms of a given disorder into a single concept. It maintains a 
list of 6,857 rare disease concepts (and 11,803 synonyms) on its Web 
site of which about 800 have extensive information on resources 
relating to questions by the public. The rare disease concepts are 
either diseases for which information requests have been made to 
directly to the Office of Rare Diseases Research, the Genetic and Rare 
Diseases Information Center (CARD) which is funded by theORDR 
and the National Human Genome Research Institute (NHGRI), or 
NHGRI directly; or (2) diseases from various data sources and those 

®http: / /rarediseases.info.nih.gov. 
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that over the last 10 years have been suggested as being rare. The 
purpose of the Rare Diseases and Related Terms list is to facilitate 
the distribution of informafion. 

Orphanet^ provides information about 5,954 rare diseases. Orphanet 
diseases are organized Into a Directed Acyclic Graph. In the Or¬ 
phanet database, diseases are linked to external reference terminolo¬ 
gies, such as ICDIO and OMIM. The Orphanet list of rare diseases 
comprises 7,715 concepts. We acquired a list of 7,715 preferred terms 
and 5,224 synonyms. Addihonally, Orphanet shared with us the 
correspondence they established between rare disease concepts and 
OMIM and ICDlO codes. 

Target terminologies 

The ICD is the international standard diagnostic classification for 
all general epidemiological, many healfh management purposes and 
clinical use. It is used to classify diseases and ofher health problems 
recorded on many types of healfh and vifal records including deafh 
cerfificafes and healfh records. In addition to enabling the storage 
and retrieval of diagnosfic informafion for clinical, epidemiological 
and qualify purposes, fhese records also provide fhe basis for fhe 
compilation of nafional morfalify and morbidify sfafisfics by World 
Healfh Organizafion World Healfh Organizafion (WHO) Member 
Sfafes. The lOfh revision of ICD (lCD-10) is used in fhis sfudy. If is 
included in UMLS. 

The MeSH is a confrolled vocabulary developed by fhe U.S. Na¬ 
fional Library of Medicine for fhe indexing and refrieval of fhe 
biomedical liferafure, especially in fhe MEDLINE bibliographic 
dafabase. It consists of sets of terms naming some 25,000 descriptors 

^http://www.orpha.net. 
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in a hierarchical structure that permits searching at various levels of 
specificity. Version 2011 of MeSH is used in fhis sfudy. Of nofe, fhis 
version provides partial coverage for the rare disease terms from 
ORDR. MeSH is one of fhe ferminologies in fhe UMLS. 

The OMIM is a comprehensive, aufhorifafive, and fimely com¬ 
pendium of human genes and genetic phenotypes developed at John 
Hopkins University. The full-text, referenced overviews in OMIM 
confain informafion on all known Mendelian disorders and over 
12,000 genes. OMIM focuses on the relationship between phenotype 
and genotype. Its terminological component - including clinical 
S 5 mopses - is available through the UMLS. 

The Systematized Nomenclature of Medicine (SNOMED-CT) 
is the world's largest clinical terminology developed by the In¬ 
ternational Health Terminology Standard Development Organiza¬ 
tion (IHTSDO) for use in electronic health records. It covers most 
areas of clinical informafion such as diseases, findings, procedures, 
microorganisms, pharmaceuticals etc. SNOMED-CT provides a con¬ 
sistent way to index, store, retrieve, and aggregate clinical data 
across specialties and sites of care. If also helps organizing fhe con- 
fenf of medical records, reducing fhe variability in the way data is 
captured, encoded and used for clinical care of patients and research. 
The version of SNOMED-CT used in this study is dated July 31,2010 
and is integrated in the UMLS. 

In the remainder of this paper, for simplification purpose, ORDR 
and Orphanet will be named as sources and SNOMED-CT, MeSH, 
OMIM and ICDIO as the targets. 


Method 

UMLS has been used in various data creation, indexing and 
encoding systems. It accomplishes this by conjoining the sets of 
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S 5 monyms and concept relationships in its multiple constituent ter¬ 
minologies (Merabti et al.). In this study rare disease terms from the 
two sources were mapped to the corresponding UMLS concept(s) us¬ 
ing an exact match or after normalization. Normalization abstracts 
away from such unessenfial differences as case, puncfuafion, and 
inflectional variants (e.g., singular vs. plural) and stop words in 
terms: 

Ex. Glycogen storage disease type 4 —> C0017923 (Exact Match); 

Ex. Isolated growth hormone deficiency type lA —> C1849790 (Normal¬ 
ized String). 

Because the terms from ORDR and Orphanef are all expected 
to name (rare) disorders, we restricted the UMLS concepts mapped 
to disorder concepts through a filter based on the Semantic Group 
Disorders (including such semantic t 5 rpes asDisease or Syndrome and 
Congenital Abnormality). This simple filter provides some level of 
word sense disambiguafion. 


Results 

The first results of the mapping from the sources to UMLS could 
be summarized in three categories: 

1. Unambiguous concepts 

All the terms of a given concept map to only one Concept 
Unique Identifiers (GUI): 

Ex. ORD00117 (Acrodysostosis) —>■ C0220659 (Acrodysostosis); 

Ex. ORPHA001248 (Maxillo-nasal dysplasia) —^ C0220692 (MAXILLONASAL 
DYSPLASIA, BINDER TYPE); 

Ex. NORD00312 (Conn Syndrome) -4 C1384514 (Conn Syndrome). 
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2. Ambiguous concepts 

The majority of terms of a given concept map to more than 
one cuts. There are two more sub-categories: 

• Ambiguous concepts related to granularity issue: 

ORPHAOOOO CUI1 (C0268128) CUI2 (C0220987) CUI3 (C0268131) 

Oroticaciduria Orotic aciduria 

Orotic aciduria hereditary Hereditary orotic aciduria 

Orotidylic decarboxylase 

deficiency Hereditary orotic aciduria, type 2 

Uridine monophosphate 

synthetase deficiency — — — 


Table 1: Example of an ambiguous concept related to granularity issue 

As shown in table 1, from a given Orphanet concept, three 
terms match to three different GUIs and one match to noth¬ 
ing. In this specific case Orphanet grouped together what 
SNOMED-CT put in hierarchy: 

• Ambiguous concept not related to granularity issue: 



Figure 1 

As shown in table 2 on the facing page, from a given Orphanet 
concept, the terms match to several GUIs, but from UMTS 
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ORPHA000016 

cun (C0339537) 

CUI2 (C1844778)) 

Blue cone monochromatism 
Achromatopsia incomplete, X-linked 
Achromatopsia, atypical, X linked 
S-cone monochromatism 

Blue cone monochromatism 

Achromatopsia, incomplete, x-linked 


Table 2: Example of an ambiguous concept not related to granularity issue. 


perspective we don't have any additional information because 
both terms come from OMIM, so we don'f have any informa- 
fion abouf hierarchical relations. 

3. Unmatched Concepts 

There are some ferms from fhe sources fhaf have no mapping 
in fargef sources in UMLS: 

• Laferal body wall complex 

• Levy-Yeboa S 5 mdrome 

The possible explanation for fhat could be because these are 
extremely rare diseases (e.g. Lateral body wall complex, ap¬ 
proximately 250 cases have been reported in the literature so 
far) or recently discovered (e.g. Levy-Yeboa Syndrome, discov¬ 
ered in June 2006). 


Overall representation in targets 

The figure 2 on page 11 shows a part of the overall representation 
in target sources in the UMLS. On the total number of concepts 
mapped to UMLS (8,435), we noticed a good representation in the 
sources we focused fhe affenfion: 
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1. MeSH 5,663 (67%) 

2. SNOMED-CT 4,192 (50%) 

3. OMIM 3,802 (45%) 

4. ICDIO 1,029 (12%) 

As shown in figure 2 on the facing page, the blank columns 
represent those sources that have a very small number of mappings 
(only one or two). This is because some of them were created for a 
specific context, e.g.: 

• NANDA nursing diagnoses: definitions & classification (NAN) 

• Ultrasound Structured Attribute Reporting (ULT) 

• Foundational Model of Anatomy Ontology (FMA) 


Overlap among sources 

Figure 3 on page 12^ shows the representation of the overlap 
among sources. From the ORDR perspective there is 59% of com¬ 
mon concepts with Orphanet and 13% with NORD; from Orphanet 
perspective there is the 43% of common concepts with ORDR and 
17% with NORD; and from NORD perspechve, there is the 97% of 
common concepts with ORDR and 92% with Orphanet. 


Additional information for a given concept 
from sources 

Among the objectives of this work we set out to find, where pro¬ 
vided, additional information for the given concepts from rare dis- 

^For better details, see http://leo.cilea.it/index.php/ilis/article/ 
downloadSuppFile/4783/5747. 
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Figure 2: Overlap among sources and representation in targets 
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Figure 3: Overlap among sources 


eases lists. After analyzing the representation in the target sources, 
we went deeper in details to find synonyms and more specific ferms 
from fargef vocabularies. As shown in fhe example below, from 
a given concepf common in fhe sfarfing sources, we found fhaf 
SNOMED-CT can provide addifional S 5 monyms and descendanfs: 

Cryptococcosis: 

• Torulosis 

• Busse-Buschke's disease 

• European blastomycosis 

• European Blastomycosis 

• Busse-Buschke disease 

Additional synonyms provided by SNOMED-CT: 

• European cryptococcosis 
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• Infection by Cryptococcus neoformans 

• Torula 

Additional descendants provided by SNOMED-CT: 

• Systemic cryptococcosis 

• Cryptococcal gastroenteritis 

• Cryptococcosis associated with AIDS 

• Cryptococcus infection of the central nervous system 

• Disseminated cryptococcosis 

• Hepatic cryptococcosis 

• Mucocutaneous cryptococcosis 

• Ocular cryptococcosis 

• Osseous cryptococcosis 

• Pulmonary cryptococcosis 


Limitations 

In some cases we didn't find any correspondence of ferms or 
concepfs in UMLS. This is parfly because everything is seen from 
UMLS perspecfive; which is because fhe fargef sources organize 
in differenf ways fhe ferms from fheir perspecfives fhaf makes fhe 
difference among fhe several vocabularies included in UMLS. We 
also noficed fhaf some concepfs nof presenf in UMLS, buf probably 
because fhere are some diseases fhaf are exfremely rare and also 
because some of them have been recently discovered. If we focus fhe 
aftenfion only on Orphanet, maybe we overestimated the percentage 
of unmapped concepfs because in fhe lisf of ferms fhere are some 
fhaf are very general ferms as "rare genetic skin disease" versus whaf 
we have in fargef sources really specific as "xeroderma pigmentosus". 
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Conclusion 

Rare diseases are insufficiently and inconsistently represented in 
medical terminologies. More than 50% of rare diseases concepts are 
still not aligned. Automatic approaches can be used to create a draft 
of the alignment and facilitate the work of domain experts. We found 
a good representation in target sources in UMLS, especially in the 
sources where we focused the attention; we also found additional 
information for the rare diseases concepts. We will share the result 
with the organizations that work in this particular field so that to 
enhance the information retrieval. They will provide to review all 
data with the supervision of clinical experts. This work could be 
also a feedback to UMLS, for those terms that ORDR, Orphanet and 
NORD grouped together and UMLS doesn't. 
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Abstract: Rare disease patients too often face common problems, including the 
lack of access to correct diagnosis, lack of quality information on the disease, lack of 
scientific knowledge of the disease, inequities and difficulties in access to treatment 
and care. These things could be changed by implementing a comprehensive approach 
to rare diseases, increasing international cooperation in scientific research, by gaining 
and sharing scientific knowledge about and by developing tools for extracting and 
sharing knowledge. A significant aspect to analyze is the organization of knowledge 
in the biomedical field for the proper management and recovery of health informa¬ 
tion. Eor these purposes, the sources needed have been acquired from the Office of 
Rare Diseases Research, the National Organization of Rare Disorders and Orphanet, 
organizations that provide information to patients and physicians and facilitate the 
exchange of information among different actors involved in this field. The present 
paper shows the representation of rare diseases terms in biomedical terminologies 
such as MeSH, ICD-10, SNOMED CT and OMIM, leveraging the fact that these termi¬ 
nologies are integrated in the UMLS. At the first level, it was analyzed the overlap 
among sources and at a second level, the presence of rare diseases terms in target 
sources included in UMLS, working at the term and concept level. We found that 
MeSH has the best representation of rare diseases terms. 
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